Add hi-IN , Ko-KR and pt-BR IPA tokenizer support#15567
Add hi-IN , Ko-KR and pt-BR IPA tokenizer support#15567quapham wants to merge 19 commits intoNVIDIA-NeMo:mainfrom
Conversation
|
Can you fix the linting and sign off issues? |
There was a problem hiding this comment.
Pull request overview
Extends the NeMo TTS IPA G2P/tokenizer stack to better handle additional scripts/locales (Hindi with English code-switching, Korean, and Brazilian Portuguese) by expanding tokenization character coverage, dictionary parsing, and adding unit tests to validate expected IPA outputs.
Changes:
- Added unit tests for IPA tokenization in
pt-BR,hi-IN(Hindi/English code-switching), andko-KR. - Expanded “any-locale” tokenization character coverage to include Indic and Korean Unicode ranges.
- Updated
IpaG2pdictionary parsing and regex handling to accept Indic/Korean words and merge multiple dictionaries.
Reviewed changes
Copilot reviewed 4 out of 7 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/collections/common/tokenizers/text_to_speech/test_tts_tokenizers.py | Adds unit tests and small in-test pronunciation dictionaries for pt-BR, hi-IN code-switching, and ko-KR. |
| nemo/collections/tts/g2p/models/i18n_ipa.py | Extends IpaG2p regex + dictionary parsing to support Indic/Korean and multi-dict merging for code-switching. |
| nemo/collections/common/tokenizers/text_to_speech/tokenizer_utils.py | Adds Indic and Korean Unicode ranges and expands any-locale word tokenization regex accordingly. |
| nemo/collections/common/tokenizers/text_to_speech/ipa_lexicon.py | Adds pt-BR and ko-KR to supported locales and extends punctuation sets for hi-IN and ko-KR. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def __init__( | ||
| self, | ||
| phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]], | ||
| # phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]], | ||
| phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]], | ||
| locale: str = "en-US", |
There was a problem hiding this comment.
The phoneme_dict type annotation doesn't match the supported runtime behavior: the Hindi unit test passes a list of dicts for code-switching, but the annotation only allows List[Union[str, Path]]. This will trip static type checking and makes the API contract unclear; broaden the union to allow lists containing dicts (or use a Sequence[...]) and update the parameter docstring accordingly (also remove the stale commented-out type line).
There was a problem hiding this comment.
i agree with Copilot's comment. Need to remove stale commented-out type line and fix typing. This appears in three places: __init__, _parse_phoneme_dict, and replace_dict.
The type List[Union[str, pathlib.Path]] doesn't reflect the actual runtime behavior. The Hindi test passes [self.PHONEME_DICT_HI, self.PHONEME_DICT_EN], which is a list of dicts. The recursive call in _parse_phoneme_dict handles this correctly at runtime, but the type annotation is misleading.
| def __init__( | |
| self, | |
| phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]], | |
| # phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]], | |
| phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]], | |
| locale: str = "en-US", | |
| def __init__( | |
| self, | |
| phoneme_dict: Union[ | |
| str, pathlib.Path, List[Union[str, pathlib.Path, Dict[str, List[List[str]]]]], Dict[str, List[List[str]]] | |
| ], |
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quapham <quapham@users.noreply.github.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quapham <quapham@users.noreply.github.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
065c5cb to
2151454
Compare
| @staticmethod | ||
| def _parse_phoneme_dict( | ||
| phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]] | ||
| phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]] |
There was a problem hiding this comment.
| phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]] | |
| phoneme_dict: Union[ | |
| str, pathlib.Path, List[Union[str, pathlib.Path, Dict[str, List[List[str]]]]], Dict[str, List[List[str]]] | |
| ], |
|
|
||
| def replace_dict(self, phoneme_dict: Union[str, pathlib.Path, Dict[str, List[List[str]]]]): | ||
| def replace_dict( | ||
| self, phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]] |
There was a problem hiding this comment.
| self, phoneme_dict: Union[str, pathlib.Path, List[Union[str, pathlib.Path]], Dict[str, List[List[str]]]] | |
| self, | |
| phoneme_dict: Union[ | |
| str, pathlib.Path, List[Union[str, pathlib.Path, Dict[str, List[List[str]]]]], Dict[str, List[List[str]]] | |
| ], |
| ) -> Dict[str, List[List[str]]]: | ||
| """ | ||
| parse an input IPA dictionary and save it as a dict object. | ||
| parse an input IPA dictionary (or multiple) and save it as a dict object. |
There was a problem hiding this comment.
| parse an input IPA dictionary (or multiple) and save it as a dict object. | |
| Parse one or more IPA dictionaries and return a merged dict object. |
| @@ -167,6 +174,14 @@ def _parse_phoneme_dict( | |||
There was a problem hiding this comment.
| Args: | |
| phoneme_dict: A single phoneme dictionary source or a list of sources for multi-dictionary | |
| code-switching (e.g. Hindi + English). Each source can be: | |
| - a file path (str or pathlib.Path) in CMUdict format, | |
| e.g. ``scripts/tts_dataset_files/ipa_cmudict-0.7b_nv22.06.txt`` | |
| - a dict object with CMUdict-like entries, | |
| e.g. ``{"Wire": [["ˈ", "w", "a", "ɪ", "ɚ"], ["ˈ", "w", "a", "ɪ", "ɹ"]]}`` | |
| When a list is provided, all sources are parsed and merged into a single dictionary. |
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
There was a problem hiding this comment.
@quapham I made some suggestions on the PR. pls apply if you feel my suggestions are correct. Thanks!
FYI, I directly made changes for unit tests in order to cover comprehensive cases.
…kenizers.py Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
Signed-off-by: quanpham <youngkwan199@gmail.com>
Signed-off-by: quapham <quapham@users.noreply.github.com>
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Extends the IPAG2p tokenizer to support Hindi (hi-IN) with English code-switching , Korean (ko-KR) and Brazilian Portuguese (pt-BR) locale.
Collection: [Note which collection this PR will affect]
tts, common
Changelog
Usage
# Add a code snippet demonstrating how to use thisGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information